ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917
ICU-23347 Reduce construction time of RuleBasedNumberFormat#3917grhoten wants to merge 1 commit intounicode-org:mainfrom
Conversation
| "%digits-ordinal:", | ||
| "-x: \u2212>>;", | ||
| "0: =#,##0=$(ordinal,few{de}other{ste})$;", | ||
| "%digits-ordinal:" | ||
| "-x: \u2212>>;" | ||
| "0: =#,##0=$(ordinal,few{de}other{ste})$;" |
There was a problem hiding this comment.
All of the RBNF resource bundles are being changed from an array to a single string. These were generated from a recent copy of CLDR. Some recent changes & fixes in the main branch were included. I ignored all other changes from CLDR for this pull request.
| offset(0) { | ||
| init(nullptr, UPLURAL_TYPE_CARDINAL, status); | ||
| : PluralFormat(Locale::getDefault(), status) | ||
| { | ||
| } |
There was a problem hiding this comment.
Reuse existing constructors to default specific fields.
| auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat); | ||
| if(decFmt != nullptr) { | ||
| const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status); | ||
| if (U_FAILURE(status)) { | ||
| return appendTo; | ||
| } | ||
| lnf->formatImpl(&data, status); // mutates &data | ||
| if (U_FAILURE(status)) { | ||
| return appendTo; | ||
| } | ||
| numberString = data.getStringRef().toUnicodeString(); | ||
| } else { | ||
| if (offset == 0) { | ||
| numberFormat->format(numberObject, numberString, status); | ||
| if (numberFormat != nullptr) { | ||
| auto *decFmt = dynamic_cast<DecimalFormat *>(numberFormat); | ||
| if(decFmt != nullptr) { | ||
| const number::LocalizedNumberFormatter* lnf = decFmt->toNumberFormatter(status); | ||
| if (U_FAILURE(status)) { | ||
| return appendTo; | ||
| } | ||
| lnf->formatImpl(&data, status); // mutates &data | ||
| if (U_FAILURE(status)) { | ||
| return appendTo; | ||
| } | ||
| numberString = data.getStringRef().toUnicodeString(); | ||
| } else { | ||
| numberFormat->format(numberMinusOffset, numberString, status); | ||
| if (offset == 0) { | ||
| numberFormat->format(numberObject, numberString, status); | ||
| } else { | ||
| numberFormat->format(numberMinusOffset, numberString, status); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This is an expensive formatting operation. If you're not going to actually use the formatted number, then why construct it at all! So we make numberFormat optional with the changes to this class.
| void | ||
| PluralFormat::initializeNumberFormat(UErrorCode& status) { | ||
| if (U_SUCCESS(status) && numberFormat == nullptr | ||
| && (msgPattern.countParts() == 0 || msgPattern.getPatternString().indexOf(u'#') >= 0)) | ||
| { | ||
| // The pattern may need to format a number later. | ||
| // Let's cache this expensive to use number format. | ||
| numberFormat = NumberFormat::createInstance(locale, status); | ||
| } | ||
| // either we failed, we use the existing one, or the pattern doesn't format a number. | ||
| } |
There was a problem hiding this comment.
This is an important optimization. It saves construction time, and formatting time. RBNF uses a lot of these objects, which uses a lot of heap. None of them use a plural format that uses a number. So don't construct it if it's impossible to use.
| UResourceBundle* rbnfRules = ures_getByKeyWithFallback(nfrb, rules_tag, nullptr, &status); | ||
| if (U_FAILURE(status)) { | ||
| ures_close(nfrb); | ||
| } | ||
| UResourceBundle* ruleSets = ures_getByKeyWithFallback(rbnfRules, fmt_tag, nullptr, &status); | ||
| if (U_FAILURE(status)) { | ||
| ures_close(rbnfRules); | ||
| ures_close(nfrb); | ||
| return; | ||
| } | ||
|
|
||
| UnicodeString desc; | ||
| while (ures_hasNext(ruleSets)) { | ||
| desc.append(ures_getNextUnicodeString(ruleSets,nullptr,&status)); | ||
| nfrb = ures_getByKeyWithFallback(nfrb, rules_tag, nfrb, &status); | ||
| nfrb = ures_getByKeyWithFallback(nfrb, fmt_tag, nfrb, &status); | ||
| if (U_SUCCESS(status)) { | ||
| UParseError perror; | ||
| init(ures_getUnicodeString(nfrb, &status), locinfo, perror, status); | ||
| } | ||
| UParseError perror; | ||
|
|
||
| init(desc, locinfo, perror, status); | ||
|
|
||
| ures_close(ruleSets); | ||
| ures_close(rbnfRules); |
There was a problem hiding this comment.
Reuse the resource bundle object, and use a single string instead of an array.
| // iterate through the characters... | ||
| UnicodeString result; | ||
| // Find the first semicolon followed by whitespace or another semicolon, | ||
| // or leading whitespace/semicolons. Everything before that point is already clean. | ||
| int32_t len = description.length(); | ||
| int32_t start = 0; | ||
| if (len > 0) { | ||
| UChar ch = description.charAt(0); | ||
| if (!PatternProps::isWhiteSpace(ch) && ch != gSemiColon) { | ||
| for (int32_t i = 0; i < len - 1; ++i) { | ||
| if (description.charAt(i) == gSemiColon) { | ||
| UChar next = description.charAt(i + 1); | ||
| if (PatternProps::isWhiteSpace(next) || next == gSemiColon) { | ||
| start = i + 1; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| if (start == 0) { | ||
| return; // No whitespace to strip anywhere. | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Copy the clean prefix, then strip whitespace from the rest. | ||
| UnicodeString result(description, 0, start); | ||
|
|
||
| int start = 0; |
There was a problem hiding this comment.
Don't copy a string, if it's already in an optimized format.
There was a problem hiding this comment.
Languages with a lot of grammatical cases tend to have longer strings.
There are some changes that can be done to reduce the construction time and heap usage of the RuleBasedNumberFormat.
The changes include:
rbnf/TestAllLocales. That’s about 6.5 MB. Some RBNF rulesets use a lot of plural format objects, and most won’t be referenced.Checklist